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1. INTRODUCTION 

It is our goal to improve the efficiency of compiling variational forms with FFC, the 
FEniCS Form Compiler, previously presented in [Kirby and Logg 2006]. FFC auto- 
matically generates efficient low-level code for evaluating a wide class of multilinear 
variational forms associated with finite element methods [Ciarlet 1976; Hughes 
1987; Brenner and Scott 1994; Eriksson et al. 1996] for partial differential equa- 
tions. However, the efficiency of FFC decreases rapidly with the complexity of the 
variational form and polynomial degree. In this paper, we investigate the core algo- 
rithms of FFC and rephrase them so as to diminish interpretive overhead and make 
better use of optimized numerical libraries. We thus wish to decrease the run-time 
for the form compiler, corresponding to a reduced compile-time for a finite element 
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code. This becomes particularly important when FFC is used as a just-in-time 
compiler to generate and compile code on the fly at run-time. 

1.1 FFC, the FEniCS Form Compiler 

FFC [Logg 2006] was first released in 2004 as a prototype compiler for variational 
forms, automating a key step in the implementation of finite element methods. [Logg 
2004]. Given a multilinear variational form and an affinely mapped simplex, FFC 
automatically generates low-level code for evaluation of the variational form (assem- 
bly of the associated linear system). More precisely, FFC generates efficient low- 
level code for computation of the element tensor (element stiffness matrix), based 
on the novel approach of representing the element tensor as a special tensor con- 
traction presented earlier in [Kirby et al. 2005, SISC] and [Kirby ct al. 2004]. FFC 
is implemented in Python and provides both a command-line and Python interface 
for the specification of variational forms in a syntax very close to the mathematical 
notation. Together with other components of the FEniCS project [Hoffman et al. 
2006, FEniCS], such as FIAT [Kirby 2006a; 2004; 2006b] and DOLFIN [Hoffman 
et al. 2006, DOLFIN], FFC automates some central aspects of the finite element 
method. 

There exist today a number of competing efforts that strive to automate the 
finite element method. One such example is Sundance [Long 2003], which similarly 
to FFC provides a system for automated assembly /evaluation of variational forms 
given in mathematical notation. A main difference between Sundance and FFC is 
that Sundance provides a run-time system for parsing and evaluation of variational 
forms, whereas FFC precomputes important quantities at compile-timc, which in 
many cases allows for generation of more efficient code for the run-time assembly 
of linear systems. Automated assembly from a high-level specification of a varia- 
tional form is also supported by FreeFEM [Pironneau et al. 2006], GetDP [Dular 
and Geuzaine 2006] and Analysa [Bagheri and Scott 2003], which all implement 
domain-specific languages for specification and implementation of finite element 
methods for partial differential equations. Other projects such as Diffpack [Lang- 
tangen 1999] and deal. II [Bangerth et al. 2006] provide sophisticated libraries aid- 
ing implementation of finite element methods. These libraries provide tools such as 
meshes, ordering of degrees of freedom, and interfaces to solvers, but do not provide 
automated evaluation of variational forms. 

Since the first release of FFC, a number of improvements have been made, mostly 
improving on the functionality of the compiler. New features that have been added 
since the work [Kirby and Logg 2006] include support for mixed finite element 
formulations, an extension of the form language to include linear algebra oper- 
ations such as inner products and matrix-vector products, differential operators 
such as the gradient, divergence and rotation, local projections between finite ele- 
ment spaces and an option to generate code in terms of level 2 BLAS operations. 
FFC also functions as a just-in-time compiler for PyDOLFIN, the Python interface 
of DOLFIN [Hoffman ct al. 2006, DOLFIN]. 

However, the performance of FFC has been suboptimal, potentially lengthening 
development cycles for high-order simulations in three dimensions. Additionally, 
this inefficiency would inhibit the embedding of FFC in a run-time system, such as 
Sundance, as a just-in-time compiler. 
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1.2 Main results 

The main purpose of this paper is twofold. First, we extend and formalize our par- 
ticular representation of multilinear variational forms. This involves writing each 
form as a sum of monomials, which are integrals of products of (derivatives of) the 
basis functions. Hence, we prove a representation theorem showing that the eval- 
uation of any monomial form is equivalent to a contraction of a reference tensor 
with a geometry tensor. This makes precise what we have discussed in previous 
work [Kirby et al. 2005; Kirby and Logg 2006]. Second, as evaluating the refer- 
ence tensor is a dominant cost of FFC, we discuss how to improve the efficiency of 
building the reference tensor for each monomial by rewriting the computation in 
terms of operations that may be performed by optimized libraries and by hoisting 
loop invariants. The loop hoisting is non-trivial since the depth of the loop nesting 
is not known until the code is executed. The speedups gained with the new algo- 
rithm for a series of test cases range between one and three orders of magnitude. 
Furthermore, we introduce the concept of signatures for monomial forms to allow 
factorization of common monomial terms when evaluating a given multilinear form. 

1.3 Outline of this paper 

In Section 2, we first derive a representation for the element tensor as a contraction 
of two tensors, which is at the core of the implementation of FFC. We then, in 
Section 3, discuss different approaches to precomputation of the monomial integrals 
that appear in this tensor representation, concluding that a suitable rearrangement 
of the computation can lead to significant improvement in performance. 

To test the new algorithm, we present in Section 4 benchmark results for a series 
of test cases, comparing the latest version of FFC with a previous version. Finally, 
we summarize our findings in Section 5. 

2. EVALUATION OF MULTILINEAR FORMS 

We review here the basic idea of tensor representation of multilinear variational 
forms, as first presented in [Kirby ct al. 2005; Kirby and Logg 2006], and derive a 
representation theorem for a general class of multilinear forms. 

At the core of finite element methods for partial differential equations is the 
assembly of a linear system from a given bilinear form. In general, we consider a 
multilinear form 

a : V x X V 2 X • • • X V r ->• M, (1) 

defined on the product space V\ x V% x • • • x V r of a given set {T^}[ =1 of discrete 
function spaces on a triangulation T of a domain !lcl' i . 

Typically, r = 1 (linear form) or r = 2 (bilinear form), but the form compiler FFC 
can handle multilinear forms of arbitrary arity r. In the simplest case, all function 
spaces are equal but there are many important examples, such as mixed methods, 
where it is important to consider arguments coming from different function spaces. 

2.1 The element tensor 

Let b c bases of V 1 ,V 2 ,...,V r and let i = 

(jx, 12, ■ ■ ■ , i r ) be a multiindex. The multilinear form a then defines a rank r tensor, 
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given by 

Ai = a(4,^ 2 ,...,<C)- ( 2 ) 
In the case of a bilinear form, the tensor A is a matrix (the stiffness matrix), and 
in the case of a linear form, the tensor A is a vector (the load vector). 

As discussed in [Kirby and Logg 2006] , to compute the tensor A by assembly, we 
need to compute the element tensor A K on each clement K of the triangulation 
T of f2. With {4>f ,1 } 7 i=i the restriction to K of the subset of \jt>Y\f=i supported 
on K, = spanl^' 1 }™^ and the local spaces , . . . , defined similarly, we 
need to evaluate the rank r element tensor A K , given by 

Af =a K (^ 1 ,^ 2 ,...,^ r ) Wei, (3) 

where ok is the local contribution to the given multilinear form a on the element K 
and where X is the index set 

r 

I =I1[ 1 'I^ A 'I] = {( 1 ' 1 ' (M. ...,2),...,(ni,n2,...,nr)}. (4) 

i=i 

We restrict our discussion to multilinear forms that may be written as a sum over 
terms consisting of integrals over fl of products of derivatives of functions from sets 
of discrete spaces {T^}[ =1 . We call such terms monomials. For one such term, the 
element tensor takes the following (preliminary) form: 




where the subscript ij (i) picks out a basis function from the restriction of Vj to K 
for the current multiindex i and where Sj is the multiindex for the corresponding 
derivative. For sequences of multiindices such as {5j}j =1 , we use the convention 
that Sjk denotes the fcth element of the jth multiindex for ft — 1, 2, . . . , \Sj\. 

To explain the notation, we consider a couple of illustrative examples. First, 
consider the bilinear form a(v,u) — J n vudx corresponding to a mass matrix. The 
element tensor (matrix) is then given by 

Af= I < 4 <' 2 dx, (6) 
Jk 

and so, in the notation of (5), we have r — 2, Lj(i) = ij and Sj = for j = 1,2, 
where denotes an empty multiindex (basis function is not differentiated). 

Next, we consider the bilinear form a(v,u) = j n dx with corresponding 

element tensor given by 

r d 2 6 K > 2 

which wc can phrase in the notation of (5) with r = 2, Lj(i) = ij, 5% — and 
5 2 = (1,1)- 

More generally, variational problems involve sums over monomial terms, each of 
which may include a spatially varying coefficient. We express such a coefficient in 
a finite element basis as y"_ T c 7 </> 7 . Also, the function spaces involved may each 
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be vector- valued. While wc might reduce this situation to a collection of cases of 
the form (5), we instead extend our canonical form. This allows us to write the 
element tensor as 

« m 

a ? = e / n^-w^x^MT)]^, (8) 

iec jK j=i 

with summation over some appropriate index set C, where Kji^f) denotes a compo- 
nent index for factor j depending on 7. To distinguish a component index from a 
basis function index, we here use [•] to denote a component index. Note that the 
number of factors m may be different from the rank r of the element tensor (arity 
of the form) . 

To illustrate the notation, we consider the bilinear 1 form on V\ x V2 for the 
weighted vector-valued Poisson's equation with given variable coefficient function 
w £ V 3 : 



d 

a(v,u) = I w\7v : Vu da; = / w L ,1J 1 111 dx. (9) 

Jn 71=l72=1 ^n ^72 ^72 

The corresponding clement tensor is given by 

[7i] dcj) l2 ' [71] K 



dv[ji] du[ji} 



71=172=173=1 72 



with {if^ji^Li the expansion coefficients for w in the local basis on K for V3. 
In the notation of (8), wc thus have r — 2, m = 3, t(i,7) = (i 1 , *2 5 73)) ^(7) = 
(72,72,0), «(t) = (7i,7i,0) and Cj-(7) = (1,1, w^). The index set C is given by 
C = {(71,72,73)} =[l,d] 2 x[l,|^|]. 

One may argue that the canonical form (8) may always be reduced to the simpler 
form (5) by considering any given element tensor as a suitable transformation (sum- 
mation and multiplication with coefficients) of basic element tensors of the simple 
form (5). However, we shall consider the more general form (8) since it increases 
the granularity of the operation of computing the reference element tensor, which is 
the operation we set out to optimize. It is also a more flexible representation in that 
it allows us to directly express the element tensor for a wider range of multilinear 
forms such as that for the vector-valued Poisson's equation. 

2.2 Representing the element tensor as a tensor contraction 

We may write the element tensor for any (affine) element if as a contraction of one 
tensor depending only on the form and function spaces with another depending only 
on the geometry and coefficients in the problem. This is accomplished by affinely 
transforming from K to & reference element K . In order to show this, we first 
state some basic results providing a notational framework for our representation 
theorem. 



1 Note that we may alternatively consider this to be a trilinear form, if we think of the coefficient w 
as a free argument to the form and not as a fixed given function. 
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We first make the following observation about interchanging product and sum- 
mation. 

Lemma 1 (Interchanging product and summation). With 
J\i J%i ■ ■ ■ i Jm o- given sequence of index sets, product and summation may 
be interchanged as follows: 

m m m 

H zl " X II" S II" 

»=ijGJi j£Jix...xj m i=i ien™=iJt i=1 

where each j £ OfcLi <^fc * s a niultiindex of length \j\ = m. 

Using the notation of Lemma 1, we may also prove the following chain rule for 
higher order partial derivatives. 

Lemma 2 (Chain rule). If F : R d — > R d is a bijective and differ entiable map- 
ping (a diffeomorphism) between two coordinate systems, x = F(X), then 

*- e (n^W (12) 

5'e[l,d]i*l \fc=l k J 
for each multiindex 8, where D & x = Y[[=i an d D x = Ili=i ex , • 
Proof. By the standard chain rule and Lemma 1, we have 



s -pi- _d_ ji dXs L _d_ x ^ -pi- 9 
11 ^ 11 fl T! a^, Z^ 11 



i=l ' i=1.5'=l ! S'e[l,«f]l*l»=l ! ! 

t=l L t=l > 5'6[l,d]l 5 l \fe=l / 



(13) 



□ 

We may now prove the following representation theorem 2 for the element tensor. 

Theorem 1 (Representation theorem). If Fk is a given affine mapping 
from a reference element Kq to an element K and {V J K }^L 1 is a given set of discrete 
function spaces on K , each generated by a discrete function space on the reference 
element through the affine mapping, that is, for each 4> £ there is some $ £ V® 
such that $ = 4>oFk, then the element tensor (8) may be represented as the tensor 
contraction of a reference tensor A and a geometry tensor Gk, 

A K =A° : G K , (14) 

that is, 

A? = J24! a G K Wei, (15) 



2 A similar representation was derived and presented in [Kirby and Logg 2006] but in less formal 
notation. 
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where the reference tensor A is independent of K . In particular, the reference 
tensor A is given by 



n^ (a '"H(W)Ma,/3)]dX, 

PfcD C °j=l 



(16) 



and the geometry tensor Gk is the outer product of the coefficients of any weight 
functions with a tensor that depends only on the Jacobian Fk, 

2 ]5 i'^ )] dX 5 > (a0) 

j=l /3eB' j' = l k=l °j>k(a,PI 

for some appropriate index sets A, B and B' . We refer the the index set X as the 
set of primary indices, the index set A as the set of secondary indices, and to the 
index sets B and B' as auxiliary indices. 

Proof. Starting from (8), we may move the product of constant expansion co- 
efficients outside of the integral to obtain 



a— e / n c iW^ (7) <(t 7) K-(7)]d^ 



yec JK j=i 



TTl „ m 

EII c i'W/ n^ (7) <(i, 7 )[«i(7)]dx. 
7 ecj'=i jK j=i 



(18) 



We now make a change of variables through Fk, mapping coordinates X 6 Kq 
to coordinates x = Fk(X) 6 K, to carry out the integration on the reference 
element Kq. By Lemma 2, we thus obtain 



^=En^) n e n, 

7 ecj'=i " /Rr °J=i<5'e[i,d] l ^ lfc = 1 lk 



5A dX» 



x 



^i ( i )7) M7)] detF^dX 



— x 



(19) 



=e e n^)pn f 

-yeCs'eU7LiM w ^ =1 JKa j=ik=i °* k 
x C 5^(i, 7) K-(7)] deti^dX, 

where we have also used Lemma 1 to change the order of multiplication and summa- 
tion. Now, since the mapping Fk is afhne, the transforms ^ and the determinant 
are constant and may thus be pulled out of the integral. As a consequence, we 
obtain 

rn IV I dX g, 

^f=E e n^wdet^nn^f x 

■yec S 'eUr=iM w j ' =1 j"=i fc=i V* 



1J/;;<1»; k ; «-i ,i.v. 



A'o 3 - =1 
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The summation over C and YiiL d]> s '\ may now be rearranged as a summation 
over an index set B local to the terms of the integrand, a summation over an index 
set B' local to the terms outside of the integral, and a summation over an index set 
A common to all terms. We may thus express the element tensor A K as the tensor 
contraction 

A? = £ A° a G£, (21) 



where 



~ m 

Or- K? J Kd ■ 1 



l3eB jK ° j=l 



G K = J[ Cj (a) dot F' K J2 ] 



M«./3)l f)Y 



(22) 



0GB' f=l k=l dx ^' k (a,P) 



Note that since each coefficient Cj(a) in the geometry tensor Gk is always paired 
with a corresponding basis function in the reference tensor A , we were able to re- 
order the summation to move the coefficients outside of the summation over £>'. □ 

As demonstrated earlier in [Kirby and Logg 2006], the representation (14) in com- 
bination with prccomputation of the reference tensor ^4° may lead to very efficient 
computation of the element tensor A , with typical run-time speedups ranging 
between a factor 10 and a factor 1000 compared to standard run-time evaluation 
of the element tensor by numerical quadrature. The speedup is a direct result of 
the reduced operation count for the computation of the element tensor based on 
the tensor representation. In addition, one may omit multiplication with zeros and 
detect symmetries or other dependencies between the entries of the element tensor 
to further reduce the operation count as discussed in [Kirby ct al. 2005; Kirby et al. 
2006]. 

The rank of the reference tensor is determined both by the arity of the multilinear 
form and how the form is expressed as a product of coefficients and derivatives of 
basis functions. In general, the rank of the reference tensor is |i| + |a|, where |i| = r 
is the arity of the form and |a| is the rank of the geometry tensor. As a rule of 
thumb, the rank of the geometry tensor is the sum of the number of coefficients nc 
and the number of derivatives no appearing in the definition of the form, and thus 
the rank of the reference tensor for a bilinear form is 2 + nc + njj . Examples are 
given below in Section 4 for a set of test cases. 

2.3 Run-time evaluation of the tensor contraction 

This framework also maps onto using matrix-vector or matrix-matrix products at 
run-time. We may recast the tensor contraction (14) as a matrix-vector product for 
each K in the mesh. This involves first casting A as a matrix by labeling all the 
items of the index set I with integers in [1, \X\] and the index set A with integers 
in [1, |.4|]. This ordering of the multiindices i £ I corresponds to the rows of the 
matrix and the ordering of the multiindices a £ A corresponds to the columns. The 
same ordering is imposed on Gk to make it a vector. Furthermore, we may take a 
batch of elements T' C T and compute {A k }k£T' with a matrix-matrix product. 
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Currently, FFC supports code generation that sets up the matrix-vector products 
via level 2 BLAS, and computing batches of elements with matrix-matrix products 
via level 3 BLAS will be supported in a future version. 

2.4 Equivalence of reference tensors 

As noted earlier, forming the reference tensor is a dominant part of the cost for 
FFC when compiling code for the evaluation of multilinear forms. Before we proceed 
to discuss algorithms for efficiently evaluating the tensor for a monomial term in 
the next section, we conclude the discussion of form representation by noting that 
particular monomial terms have the same reference tensor but different geometry 
tensors. In such cases, the total cost may thus be reduced by recognizing the 
common reference tensor and only computing it once. 

As an example, consider each term of the two-dimensional Laplacian, 

_ f a <v' «€• ' 



where j = 1, 2 is the coordinate direction. By a suitable definition of the index set C, 
the sum of both terms may be phrased as a single canonical form (8) . We may also 
consider the two terms separately and write each term in the canonical form (8). 
After changing coordinates to the reference domain, we obtain the reference tensors 

/ ^^ dX , j = 1,2, (24) 



Ka dX ai dX 



and geometry tensors 



Note that the two terms of the form indeed have the same reference tensor but 
different geometry tensors. This has both compile-time and run-time implications. 
At compile-time, FFC should recognize this structure, hence building the reference 
tensor only once and generating code for a single geometry tensor that sums the 
basic parts. When the generated code is executed at run-time, this corresponds to 
fewer instructions and hence better performance. We may formalize this as follows. 

Theorem 2. Consider two multilinear variational forms with corresponding el- 
ement tensors and Bf of the form (5) defined over spaces {V^ 4 '^}?^ and 
{Vf ' K }j=i ■ respectively, that is, 

A ? = J K p. D * (26) 

Bf= I \\n; o ! ;, i) ^,. (27) 



K i=l 



Suppose that r a — rB = r, = , = and \5^\ = \S^\ for j = 1, 2, . . . , r. 
Then, the corresponding reference tensors are equal, that is, A° = B . Moreover, 
this relationship is an equivalence relation on the set of variational forms. 
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We remark that this result can be generalized slightly. If permuting the integrands 
of one form, say B, would lead to the hypotheses of this theorem being satisfied, 
then A and B° are the same after a similar permutation of the axes of B°. 

FFC recognizes and factors out common reference tensors by computing for each 
term of a given multilinear form a string that uniquely identifies the term. This 
may be accomplished by simply concatenating the names of the finite elements 
that generate the function spaces for the basis functions in the term, together with 
derivatives and component indices. We refer to such a string as a (hard) signature 
and note that the signature may be computed cheaply for each term by just looking 
at its canonical representation. We may then factor out common reference tensors 
by checking for equality of signatures. If two terms have the same signature, they 
also have a common reference tensor that may be factored out. 

FFC also computes a soft signature for each term, which is similar to a hard 
signature but disregarding ordering of multiindices. By checking for equality of soft 
signatures, it is possible to find terms which have reference tensors that only differ 
by the ordering of their axes. If two soft signatures match but the corresponding 
hard signatures differ, it is possible to find a reordering that results in equal hard 
signatures. FFC thus computes for each term a soft signature and if the soft 
signatures match for two terms, a suitable reordering is found, and the reference 
tensor may be factored out as before. In Table I, we include the hard and soft 
signatures for the bilinear form for Poisson's equation, a{v, u) — J n Vt> • X7udx. 



1 . OOOOOOOOOOOOOOOe+00* 

{Lagrange finite element of degree 1 on a triangle ; iO ;[]; [(d/dXaO)] }* 

{Lagrange finite element of degree 1 on a triangle ; il ;[]; [(d/dXal)] }*dX 

1 . 000000000000000e+00* 

{Lagrange finite element of degree 1 on a triangle ; iO ;[]; [(d/dXa)] }* 

{Lagrange finite element of degree 1 on a triangle ; il ;[]; [(d/dXa)] }*dX 



Table I. Hard signature (top) and soft signature (bottom) for the reference tensor of the bilinear 
form a(v,u) = J n Vu • Vda; with piecewise linear elements on triangles. Note that there arc no 
line breaks in the signatures. 



3. COMPUTING THE REFERENCE TENSOR 

Given a multilinear variational form, the form compiler FFC automatically gener- 
ates the canonical form (8) and the representation (14). The computationally most 
expensive part of this process is the computation of the reference tensor A , that 
is, the tabulation of each integral 








(28) 



for i gl and a G A. 
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As an example, consider the bilinear form 

a(v,u) = / v ■ (w ■ V)u dx, (29) 
Jn 

appearing in a linearization of the incompressible Navier-Stokes equations (see 
Section 4 below). Computing the 12 x 12 x 12 x 3 x 3 = 15,552 entries of the 
rank five reference tensor ^4° for piecewise linear elements on tetrahedra takes about 
9.5 seconds on a 3.0GHz Pentium 4 with FFC version 0.2.2. Since this computation 
only needs to be done once at compile-time, one may argue that this is no big issue. 
However, limitations on computer resources can be a limit to the complexity of the 
forms we can compile and the degree of polynomials we can use. Furthermore, long 
turn-around times to compile new, complex models diminish the usefulness of FFC 
as a tool for truly rapid development. 

We present below two very different ways to compute the reference tensor, first 
the obvious naive approach used in FFC version 0.2.2 and earlier versions, and then 
a more efficient algorithm used in FFC version 0.2.5 and beyond, which cuts the 
cost of computing the reference tensor by several orders of magnitude. In the case 
of the form (29) for piecewise linear elements on tetrahedra, the cost of computing 
the reference tensor is reduced from 9.5 seconds to around 0.02 seconds. 



3.1 Iterating over the entries of the reference tensor 

The obvious way to compute the reference tensor is to iterate over all indices of 
the reference tensor and compute each entry by quadrature over a suitable set of 
quadrature points {^k}k=i an d a corresponding set of quadrature weights {wk}^li 
on the reference element K , as outlined in Algorithm 1. Note that the iteration 
over multiindiccs a and (3 arc themselves multiply nested loops, however the length 
of a, (3 and hence the depth of the loop nest depends on the form being compiled. 
FFC uses the "collapscd-coordinate" Gauss- Jacobi rules described in [Karniadakis 
and Shcrwin 1999] based on taking tensor products of Gaussian integration rules 
over the square and cube and mapping them to the reference simplex. These are 
the arbitrary-order rules provided by FIAT. Since we are integrating polynomials, 
we may pick a quadrature rule which is exact for the total polynomial degree of 
the integrand. Alternatively, we can pick an approximate rule that is sufficiently 
accurate as per the theory of variational crimes [Ciarlet 1976; Brenner and Scott 
1994]. 

Algorithm 1 is expressed at a very low granularity — a loop over quadrature 
points for each entry of the reference tensor. The interpretive overhead associated 
with this algorithm explains the poor performance of earlier versions of FFC in a 
language such as Python. However, we may express the computation at a much 
higher level of granularity and leverage optimized libraries written in C, such as 
Python Numeric [Oliphant ct al. 2006]. In fact, we wind up with a loop over 
auxiliary indices and quadrature points, inside which the entire reference tensor 
is updated by an extended outer product. This higher abstraction dramatically 
improves performance while allowing us to remain in a high-level language. 
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Algorithm 1 A° = ComputeReferenceTensor() 

for i G X 

for a G A 
1 = 
for p e B 

for k = 1,2,..., N q 

i = i + Wk ut=i Df a,f}) Ka,<*,0) M«> w**) 

end for 
end for 

Ai a = I 

end for 
end for 



3.2 Assembling the reference tensor 

Algorithm 1 may be reorganized to significantly improve the performance. By first 
tabulating the basis functions at all quadrature points according to 

%p >ia = Df a ^^ a ^(a,(i)]{X k ), (30) 

which may be done efficiently using FIAT, we may improve the granularity of the 
computation by iterating over quadrature points {Xk}^^ and auxiliary indices B, 
assembling the contributions to the reference tensor from each pair (xk , P) , as out- 
lined in Algorithm 2 and Algorithm 3. 



Algorithm 2 A = AssembleReferenceTensor() 
for j = 1, 2, . . . , m 

W = Tabulate^/, {X k }%l v X, A, B, l 3 , S' jt k s ) 
end for 
A = 

for k = 1,2, ...,N g 
for $ e B 

A = A + ComputeProduct({* 3 '}^ 1 , k, /3) 
end for 
end for 



Algorithm 3 B = ComputeProduct({* J }^ 1; k, p) 
B = Wk 

for j = 1,2, ... ,m 

B = B (g> ^j.o (outer product) 
end for 
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The higher level of abstraction of Algorithm 2 allows us to simultaneously re- 
duce the interpretive overhead and make use of optimized libraries, such as the 
Python Numeric extension module. The accumulation of the outer products may 
be accomplished with the Numeric . add function, which is implemented in terms of 
efficient C loops over the low-level arrays. Moreover, the sequence of outer products 
is accumulated through calls to the function Numeric, multiply, outer. A sketch 
of the Python code corresponding to Algorithm 2 and Algorithm 3 is included in 
Table II. The full code can be downloaded from the FFC web page [Logg 2006]. 



# Iterate over quadrature points 
for k in range (num_points) : 

# Iterate over secondary indices 

for beta in B: 

# Compute cumulative outer product 
P = w[k] 

for j in range (m): 

P = Numeric .multiply . outer (P , Psi [...]) 

# Add to reference tensor 
Numeric. add(A0, P, AO) 



Tabic II. A sketch of the Python implementation of Algorithm 2 and Algorithm 3 in FFC. 

The situation is similar to that of the assembly of a global sparse matrix for a 
variational form; by separating the concerns of computing the local contribution 
(the element tensor) from the insertion of the local contribution into the global 
sparse matrix, we may optimize the two steps independently. In the former case, 
we call the optimized code generated by FFC to compute the element tensor and 
in the latter case, we may use an optimized library call such as the PETSc [Balay 
et al. 2006; Balay et al. 2004; Balay et al. 1997] call MatSetValues () . 

Moreover, a closer investigation of Algorithm 1 also reveals a source of redundant 
computation. As one entry in the multiindex a changes, most of the factors of the 
product in the innermost loop remain the same. This problem grows worse as the 
arity of the form and the number of derivatives increase. Although this is logically 
equivalent to a multiply nested loop, the structure of looping over an enumeration 
of multiindices makes it highly unlikely that an optimizing compiler would hoist 
invariants. However, Algorithm 3 includes the hoisting out of the arbitrary-depth 
loop nest. 

4. BENCHMARK RESULTS 

To measure the efficiency of the proposed Algorithm 2, we compute the reference 
tensor for a series of test cases, comparing FFC version 0.2.2, which is based on 
Algorithm 1, with FFC version 0.2.5, which is based on Algorithm 2. 

The benchmarks were obtained on an Intel Pentium 4 (3.0 GHz CPU, 2GB 
RAM) running Debian GNU/Linux with Python 2.4, Python Numeric 24.2-1 and 
FIAT 0.2.3. The numbers reported are the CPU times/second for the prccompu- 
tation of the reference tensor, which was previously the main bottle-neck in the 
compilation of a form. With the new and more efficient precomputation of the 
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reference tensor in FFC, the precomputation is no longer a bottle-neck and is in 
some cases dominated by the cost of code generation. 

4.1 Test cases 

We take as test cases the computation of the reference tensor for the set of bilinear 
forms used to benchmark the run-time performance of the code generated by FFC 
in [Kirby and Logg 2006]. For convenience, we choose a common discrete function 
space Vi = V% = ■ . . = V n = V for all basis functions, but there is no such limitation 
in FFC; function spaces can be mixed freely. 

We also add a fifth test case which is a more demanding problem posing real 
difficulties for earlier versions of FFC based on Algorithm 1. 

Test case 1: the mass matrix. As a first test case, we consider the computation 
of the mass matrix M with M ili2 = a{<p\ i , (/)? ) and the bilinear form a given by 



a(v, u) = / vudx. (31) 
Jn 

The corresponding rank two reference tensor takes the form 

4 ? = / SiiSfedX, (32) 



with the rank zero geometry tensor given by Gk = det F' K . 

Test case 2: Poisson's equation. As a second example, consider the bilinear form 
for Poisson's equation, 

a(v, u) = I Vu-Vwdx. (33) 
Jn 

The corresponding rank four reference tensor takes the form 

A °° = J < 34 > 

JK Q OA. ai 0A a2 

with the rank two geometry tensor given by G% = deti^ Y^p=i ~Ux^~^J~' 

Test case 3: Navier- Stokes. We consider next the nonlinear term u ■ Vu of the 
incompressible Navier-Stokes equations, 

u + u ■ V« — vAu + Vp = f, , „ 

35 

V-u = 0. 

Linearizing this term as part of either a Newton or fixed-point based solution 
method (see for example [Eriksson et al. 2003; Hoffman and Johnson 2004]), we 
need to evaluate the bilinear form 



a(v, u) = a w (v, u) — / v ■ (w ■ V)udx. (36) 
Jn 

The corresponding rank five reference tensor takes the form 

A l = J2[ QiAmM^^-dX, (37) 
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with the rank three geometry tensor given by Gf^ = det F' K w^ i g^° 3 ■ 

Test case J^: linear elasticity. As our next test case, we consider the strain-strain 
term of linear elasticity [Brenner and Scott 1994], 

a(v, u) = I i(Vv + (Vw) T ) : (Vu + (Vw) T ) dx 
Jn 4 

1 dvi dui 1 dvi duj 1 dvj dui 1 dvj duj 
rr^ x 4 dxj dxj 4 dxj dxi 4 dxi dxj 4 dxi dxi (38) 

1 dv.j dui 1 dvi dv,j 



1 UUi UUl 1 UU% UUj 

rr±. 2 9xj dxj 2 c>Xj 



»,J=1 

Considering here for simplicity only the first of the two terms 3 , the rank four 
reference tensor takes the form 

with the rank two geometry tensor given by Gg- = | det F' K X]/3=i i^p 9 ^xp • 

Test case 5: stabilization. As a final test case, we consider the bilinear form 
for a stabilization term appearing in a least-squares stabilized cG(l)cG(l) method 
for the incompressible Navier-Stokes equations [Eriksson et al. 2003; Hoffman and 
Johnson 2004]: 

f ^ ()lJ \'t\ QUL \'b\ 

a(v,u)= / (w ■ Vv) ■ (w ■ Vu) dx = V u;[j]-^w[A:]— dx. (40) 

J SI T7*' , "^j "^fc 

The corresponding rank eight reference tensor takes the form 

Al = ±[ * Q > 3 ]«* Q2 M«dX, (41) 

with the rank six geometry tensor given by G^ = w^w^ det F' K g^° B 6 ■ As 
a consequence of the high rank of the reference tensor, the computation of the 
reference tensor is very costly. For piecewise linear basis functions on tetrahedra 
with 4 x 3 = 12 basis functions on the reference element, the number of entries in 
the reference tensor is 12 x 12 x 12 x 12 x 3 x 3 x 3 x 3 = 1, 679, 616. 

4.2 Results 

In Table III, we present a summary of the speedups obtained with Algorithm 2 
(FFC version 0.2.5) compared to Algorithm 1 (FFC version 0.2.2). Detailed results 
are given in Figures 1-4 for test cases 1-4. Because of limitations in the earlier 
version of FFC, that is, the poor performance of Algorithm 1, the comparison is 
made for polynomial degree q < 8 in test cases 1-2, q < 3 in test cases 3-4 and 



3 The benchmark measures the time to compute the reference tensors for both terms. 
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q = 1 in test case 5. Higher degree forms may be compiled with FFC version 0.2.5, 
but even then the memory requirements for storing the reference tensor may in 
some cases exceed the available 2GB on the test system. 

As evident from Table III, the speedup is significant in most test cases, typically 
one or two orders of magnitude. In test case 5, the stabilization term in Navier- 
Stokes, the speedup is as large as three orders of magnitude. 



Test case 


9 = 1 


9 = 2 


g = 3 


9 = 4 


9 = 5 


9 = 6 


9 = 7 


9 = 8 


1. Mass matrix 2D 


1.4 


2.6 


4.0 


5.6 


7.6 


9.9 


12.5 


15.2 


1. Mass matrix 3D 


1.6 


3.5 


6.4 


10.8 


16.9 


23.1 


28.3 


20.9 


2. Poisson 2D 


2.5 


7.0 


11.4 


16.4 


21.9 


27.5 


33.5 


39.4 


2. Poisson 3D 


7.4 


19.3 


33.8 


47.8 


43.8 


38.8 


28.1 


23.1 


3. Navier-Stokes 2D 


67.2 


264.3 


239.0 












3. Navier-Stokes 3D 


461.3 


291.7 


82.3 












4. Elasticity 2D 


20.2 


44.3 


68.9 












4. Elasticity 3D 


142.5 


230.7 


138.0 












5. Stabilization 2D 


1114.7 
















5. Stabilization 3D 


1101.4 

















Table III. Speedups for test cases 1—5 in 2D and 3D for different polynomial degrees q of Lagrange 
basis functions. 




Fig. 1. Compilation time as function of polynomial degree q for test case 1, the mass matrix, 
specified in FFC by a = v*u*dx. 
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Fig. 2. Compilation time as function of polynomial degree q for test case 2, Poisson's equation, 
specified in FFC by a = v.dx(i)*u.dx(i)*dx. 




Fig. 3. Compilation time as function of polynomial degree q for test case 3, the nonlinear term of 
the incompressible Navier— Stokes equations, specified in FFC by a = v [i] *w[j] *u [i] .dx(j)*dx. 
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Fig. 4. Compilation time as function of polynomial degree q for test case 4, the strain-strain term 
of linear elasticity, specified in FFC by a = . 25* (v [i] .dx(j ) + v[j].dx(i)) * (u[i] .dx(j) + 
u[j] .dx(i) ) * dx. 



5. CONCLUSION 

The new improved precomputation of the reference tensor removes an outstanding 
bottleneck in the compilation of variational forms. This improves the possibilities 
of using FFC as a tool for rapid prototyping and development. The feature set for 
FFC is also quickly expanding, with an expanded form language, recently added 
support for arbitrary mixed formulations, and with built-in support for functionals, 
nonlinear formulations and error estimates on the horizon. At the same time, FFC 
is still very much a test-bed for basic research in efficient evaluation of general 
variational forms. 
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